Calibration of cognitive tests to address the reliability paradox for decision-conflict tasks

Standard, well-established cognitive tasks that produce reliable effects in group comparisons also lead to unreliable measurement when assessing individual differences. This reliability paradox has been demonstrated in decision-conflict tasks such as the Simon, Flanker, and Stroop tasks, which measure various aspects of cognitive control. We aim to address this paradox by implementing carefully calibrated versions of the standard tests with an additional manipulation to encourage processing of conflicting information, as well as combinations of standard tasks. Over five experiments, we show that a Flanker task and a combined Simon and Stroop task with the additional manipulation produced reliable estimates of individual differences in under 100 trials per task, which improves on the reliability seen in benchmark Flanker, Simon, and Stroop data. We make these tasks freely available and discuss both theoretical and applied implications regarding how the cognitive testing of individual differences is carried out.

2)) response time data. The Q-Q plots on the transformed data indicate the data more closely fit a normal distribution. Combined with higher W values, this suggests that the transformation was effective. Data from the first 432 trials for each participant per task.
Note. Bayes Factors reflect how many times more likely the observed data are under the assumed model compared to the alternative models. BF1 compares the standard model to a model assuming no practice effect. BF2 compares the standard model to a model assuming a practice × conflict effect interaction. Blocks are based on cumulative sets of 48 trials (i.e., the first row reflects 96 trials, and the final row reflects the full dataset of 432 trials).

Experiment 1
Results for each task in Experiment 1 are displayed in Supplementary Tables 3-10, where posterior samples were used to calculate median values and 95% credible intervals for the following: congruent and incongruent response times (RTs), the intercept (μ) and its standard deviation (SDμ), conflict effect (CE) and its standard deviation (SDid), measurement noise (SDn), trait precision (η), effect size (ES), and an estimation of the total number of congruent and incongruent trials required for adequate measurement (n).
We begin by looking at Flanker and Flanker2, where for both tasks, reliability (r = .8) was achieved within 27 trials given the total number of trials in the experiment (i.e., 48 per task).
Flankon and Flankon2 also revealed promising results, achieving reliability in 26 and 38 trials, respectively. However, these complex variants showed no real benefit over the basic Flanker task in terms of reliability, although there was some indication of increased effect sizes. Contrasting with Flanker, the Simon task performed poorly, with 64 trials and 72 trials required to reach reliability of r = .8 for Simon and Simon2, respectively. That is, more trials than were presented would be necessary for adequate measurement in these tasks. Stroopon and Stroopon2 were placed between the aforementioned tasks, requiring 53 and 50 trials, respectively, to achieve reliability based on the 48 trials included in each task. In this instance, only a few more trials would be needed to result in reliable measurement. Both the conflict effect and η values were generally larger for the Flanker-based tasks (including Flankon) than the Simon tasks, with the exception of Stroopon2. As a result, the tasks containing flanker required fewer trials to obtain reliable measurement. Regarding Simon and Simon2, the conflict effect was slightly larger in the latter, however, it produced marginally decreased reliability due to a larger increase in measurement noise than the individual differences in the conflict effect.
To add to these findings, Supplementary Fig. 4 provides the results for RT and accuracy for Experiment 1. In the Flanker variant of the tasks, the figures clearly demonstrate the present conflict effect that remained similar across all versions of the Flanker task, aside from some minor slowing in the Flankon task. Responses on incongruent trials were slower than responses on congruent trials. The larger effect for these tasks in comparison to the Simon-based tasks is also apparent. Again, there was some evidence of larger effects in the double shot conditions compared to the standard tasks. Additionally, despite not being the key variable of interest, we report that accuracy was superior in the congruent conditions compared to the incongruent conditions for all tasks.
In sum, Supplementary Fig. 4, and Tables 3-6, indicate that Flanker performed quite well on its own and when combined with Simon (i.e., Flankon), little additional benefit was found.
Conversely, the Simon task alone was weaker, and reliability was improved when combined with the Stroop task (i.e., Stroopon2) as shown is Supplementary Fig. 4 and Tables 7-10.
quantiles of the posterior values (i.e., the middle row is the median estimate, and the top and bottom rows give the associated 95% credible interval). Results are displayed in seconds.

Experiment 2
As a result of the Flankon task failing to produce larger and more reliable conflict effects, we did not pursue it further. We proceeded to test Flanker2 alongside Simon2 and Stroopon2.
Supplementary Tables 11-13 display the same information as the equivalent tables for Experiment 1. Again, Flanker2 was highly promising, with Simon2 performing poorest and Stroopon2 falling in the middle. Given the 48-trial block, reliability (r = .8) was achieved at 24 trials for Flanker2 and 47 trials for Stroopon2. Also similar to Experiment 1, Simon2 failed to reach reliable measurement and would instead require 66 trials (i.e., more than the number presented in this experiment). Both the conflict effect and trait precision (η) remained strong for the Flanker task and were relatively good for Stroopon.
In conjunction with the findings of Experiment 1, we found strong support for Flanker and Flanker2 and continued to use them in the final experiment. It also became apparent that Stroopon was the next best task and therefore both versions of the task were also used in the final experiment. We also included Simon2 and Stroop2 as the component parts of Stroopon2.
Finally, the results of Experiment 2 suggested that always performing double shots trials was not advantageous, thus, we returned to implementing the double shot on 1/3 of trials.
quantiles of the posterior values (i.e., the middle row is the median estimate, and the top and bottom rows give the associated 95% credible interval). Results are displayed in seconds.

Participants
In total, the sample for Experiment 1 comprised 1066 participants; we excluded 265 for failing the tutorial, 63 for exceeding the experiment's time limit, and 6 for incomplete data. A further 62 participants were excluded for low accuracy (< 60% overall), 5 for having too many anticipatory responses (> 10% of trials with RT < 0.1s), and 6 for non-responding (> 10% of trials not completed within 4 s). The final sample size was 670 across eight experimental conditions.
In experiment 2, a total of 394 participants were recruited. For failing the tutorial, exceeding the time limit, and incomplete data, we removed 144, 21, and 4 participants, respectively. Applying the same criteria as Experiment 1, we excluded the data of 2 participants for low accuracy, and 5 each for too many anticipatory responses and too many non-responses.
This resulted in a final sample size of 213. Participants were required to have a human intelligence task (HIT; MTurk terminology for a task or study) approval rate of above 95%.
In both experiments, participants received a baseline payment of $1.00 USD for attempting the study. Passing the tutorial and completing the entire experiment resulted in a $0.50 bonus, with an additional bonus between $0 and $1.00 based on performance (i.e., up to $2.50 in total). Approval for this research was granted by the University of Tasmania's Human Research Ethics Committee.

Design and Materials
For Experiments 1 and 2, participants completed the tutorial and experimental phase in one session lasting approximately 15 minutes. For failing the tutorial, the total duration was approximately 10 minutes. Participants were randomly assigned to one of eight conditions for Experiment 1: (1) Flanker, (2) Flanker2, (3) Flankon, (4) Flankon2, (5) Simon, (6) Simon2, (7) Stroopon, (8) Stroopon2. And one of three conditions for Experiment 2: (1) Flanker2, (2) Simon2, (3) Stroopon2. The tasks and responding requirements of the various tasks was identical to the final experiment (reported in the manuscript). An additional task, not described in the Method of the main text, was the Flankon task (see Supplementary Fig. 6 for illustrations). This task combines Flanker and Simon tasks, resulting in a similar display to that of flanker, however, the set of arrows is positioned to the right or left of screen to incorporate an element of a Simon task. The task remained the same as Flanker where the aim was to respond based on the central arrow while ignoring the flanking arrows and their location. For Flankon2, there were two types of second response required. A purple shield required a response based on the direction of the flanking arrows (left or right), while a yellow shield required a response dependent on the location of the arrows (left or right).

Procedure
For both Experiment 1 and Experiment 2, the eligibility criteria and tutorial procedure were consistent with the final experiment, as described in the main text. The experimental component consisted of 4 games (i.e., 48 trials in total) in a single session. Once finished, participants were advised of the bonus they received and were given an MTurk completion code.
Supplementary Figure 6. Depiction of the Flankon task in Experiment 1. The task reflects a combined Flanker and Simon task. Responses are dependent on the central arrow as in the Flanker task with the set of arrows presented on the left or right of the display: a) incongruent Flankon display, and b) congruent Flankon display. After the initial response, an enemy may present with a shield, requiring a second shot to be made: c) second shot Flankon trial following a response to the display in (a), where a purple shield requires a decision based on the direction of the flanking arrows (i.e., left), and d) a second shot trial based on (b) requiring a decision based on location (i.e., left) in response to the yellow shield.